Add DoS protection and idle timeout to vMCP session manager#4047
Add DoS protection and idle timeout to vMCP session manager#4047
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 88deb977d6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
Implements resource-exhaustion protections for SessionManagementV2 by adding configurable session limits and idle-session cleanup to reduce risk of backend connection/memory/FD exhaustion.
Changes:
- Add per-client (identity subject) session limiting and idle-session tracking/reaping to
sessionmanager.Manager. - Add server-level global session limit middleware returning HTTP 503 +
Retry-Afterfor new sessions when capped. - Extend/adjust unit tests to cover per-client limits and idle reaping behaviors.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
pkg/vmcp/server/sessionmanager/session_manager.go |
Adds limits struct, per-client counters, idle tracking + reaper, and integrates cleanup into session termination/tool calls. |
pkg/vmcp/server/server.go |
Adds config knobs/defaults, global session-limit middleware (503 + Retry-After), and wires idle reaper lifecycle into server start/shutdown. |
pkg/vmcp/server/sessionmanager/session_manager_test.go |
Updates constructors for new Limits parameter and adds tests for per-client limit + idle reaper behavior. |
Comments suppressed due to low confidence (1)
pkg/vmcp/server/server.go:607
- The new session-limit middleware is applied before AuthMiddleware in code, but due to wrapper ordering the auth middleware becomes the outermost handler and runs first. This contradicts the comment and means over-limit requests will still pay auth cost. Apply AuthMiddleware first and then wrap with sessionLimitMiddleware (so the limit check runs before auth).
// Apply session limit middleware when V2 session management is active.
// Runs before auth so over-limit requests are rejected early without auth overhead.
if s.vmcpSessionMgr != nil && s.config.MaxSessions > 0 {
mcpHandler = s.sessionLimitMiddleware(mcpHandler)
slog.Info("session limit middleware enabled", "max_sessions", s.config.MaxSessions)
}
// Apply authentication middleware if configured (runs first in chain)
if s.config.AuthMiddleware != nil {
mcpHandler = s.config.AuthMiddleware(mcpHandler)
slog.Info("authentication middleware enabled for MCP endpoints")
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4047 +/- ##
==========================================
- Coverage 69.13% 68.85% -0.28%
==========================================
Files 462 462
Lines 46554 46720 +166
==========================================
- Hits 32185 32171 -14
- Misses 11897 11916 +19
- Partials 2472 2633 +161 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
jerm-dro
left a comment
There was a problem hiding this comment.
I don't think these changes are required for the session v2 work. Would you be okay setting this work aside for now?
I think we'll tackle a related problem as part of the upcoming scale work, where we'll need to implement an LRU for the session storage so not all sessions are held in memory.
| // sessionLimitMiddleware is a best-effort fast-fail for new session requests | ||
| // (no Mcp-Session-Id header): it returns HTTP 503 + Retry-After before the | ||
| // request reaches the SDK when the global session cap appears to be reached. | ||
| // Existing sessions (with a valid Mcp-Session-Id) are never affected. | ||
| // | ||
| // This check is intentionally optimistic (non-atomic): it avoids the overhead | ||
| // of routing and SDK processing for clearly-over-limit requests, but it does | ||
| // not guarantee strict enforcement under concurrent load. Strict enforcement | ||
| // is provided atomically by sessionmanager.Manager.Generate(), which uses an | ||
| // increment-first reservation to prevent races between concurrent initialize | ||
| // requests. | ||
| func (s *Server) sessionLimitMiddleware(next http.Handler) http.Handler { | ||
| // Resolve the concrete manager once so we can call ActiveSessionCount(). | ||
| mgr, _ := s.vmcpSessionMgr.(*sessionmanager.Manager) | ||
| return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { | ||
| if r.Header.Get("Mcp-Session-Id") == "" && mgr != nil { | ||
| if mgr.ActiveSessionCount() >= s.config.MaxSessions { | ||
| w.Header().Set("Retry-After", strconv.Itoa(defaultRetryAfterSeconds)) | ||
| w.Header().Set("Content-Type", "application/json") | ||
| w.WriteHeader(http.StatusServiceUnavailable) | ||
| _, _ = w.Write([]byte( | ||
| `{"error":{"code":-32000,"message":"Maximum concurrent sessions exceeded. ` + | ||
| `Please try again later or contact administrator."}}`, | ||
| )) | ||
| return | ||
| } | ||
| } | ||
| next.ServeHTTP(w, r) | ||
| }) | ||
| } |
There was a problem hiding this comment.
Blocker: can we remove the rate limiting behavior from this PR?
It's not really needed today, so I'd prefer to avoid adding it.
| // perClientMu guards perClientCounts and sessionSubject. | ||
| perClientMu sync.Mutex | ||
| perClientCounts map[string]int // subject → active session count | ||
| sessionSubject map[string]string // sessionID → subject (for decrement on Terminate) | ||
|
|
||
| // idleActivityMu guards idleActivity. | ||
| idleActivityMu sync.RWMutex | ||
| idleActivity map[string]time.Time // sessionID → last active time | ||
|
|
||
| // activeSessionCount tracks sessions that have been generated but not yet | ||
| // terminated, excluding terminated placeholders left for TTL cleanup. | ||
| // This gives an accurate count for global limit enforcement, unlike | ||
| // storage.Count() which includes those terminated placeholders. | ||
| activeSessionCount atomic.Int64 |
There was a problem hiding this comment.
Blocker: this is adding a lot of complexity to Manager. We should refactor the Manager struct first so we don't have to mix all these concerns together in one class. That would also make unit testing much simpler. I would expect a change like this doesn't require the integration_tests at all.
|
ok setting as draft for now. I know that is not essential but i was speeding up with pending tasks while the main code was reviewed... |
Implements resource-exhaustion protections for the session-scoped backend lifecycle (SessionManagementV2), resolving issue #3874. Global session limit: new session requests (no Mcp-Session-Id header) receive HTTP 503 with a Retry-After header when the server-wide cap is reached. Default: 100 sessions. Per-client session limit: CreateSession enforces a maximum number of concurrent sessions per auth.Identity.Subject. Anonymous clients are exempt. The counter is rolled back on all failure paths. Default: 10 sessions per identity. Idle session timeout: a background reaper goroutine terminates sessions that have had no CallTool activity for longer than the configured threshold. The idle clock resets on every tool call and is initialised when a session is fully established. The reaper is wired into shutdownFuncs so it stops cleanly on server shutdown. Default: 5 min. All three limits are configurable via Config fields; zero values select the defaults. The Limits struct is passed to sessionmanager.New() and all existing call sites are updated. Closes: #3874
|
@yrobla could you take a look at this draft PR and confirm if it’s still relevant? |
well, we agreed to implement it at some point, but it was not critical. If you don't mind, i prefer to keep them as draft, to resume the work when needed. |
|
So, I went through this pretty thoroughly and I have some thoughts. First off, the intent here is good... DoS protection for session management is something we'll want eventually. But there are some issues that I think need addressing before this can move forward. I agree with Jeremy's two blockers, and I found a few more things on top of those. The Manager is already doing too much Jeremy flagged this and he's right. The The decomposition path is actually pretty clear:
Counter-drift bug in Generate() OK this one is subtle but real. When storage fails in if sm.limits.MaxSessions > 0 {
sm.activeSessionCount.Add(-1)
}If you incremented the counter, you need to decrement it on failure. Period. Making the rollback conditional on a config value that could theoretically change between the increment and the error path means the counter can drift permanently. Once it drifts, the global limit rejects new sessions forever until restart. The fix is simple... make the decrement unconditional. Also, the limited path increments before storage (reservation pattern) while the unlimited path increments after. That asymmetry is confusing and easy to break. Per-client map entries stick around at zero
The middleware splits admission logic across two layers The
Pick one place for the check. Either the middleware owns it entirely (and returns a proper 503), or Missing test coverage The tests that are here use good patterns (timestamp back-dating instead of
Codecov shows 68.7% patch coverage with 46 lines missing. Small thing: wrong change type The PR checks "Refactoring (no behavior change)" but adding session limits and idle timeout is a new feature. Not a big deal but worth fixing. Bottom line I think we should close this one out and reopen it as smaller, focused PRs once we've done the Manager decomposition. Something like:
That way each piece is testable and reviewable on its own. Thoughts? |
Summary
Implements resource-exhaustion protections for the session-scoped backend lifecycle (SessionManagementV2), resolving issue #3874.
Global session limit: new session requests (no Mcp-Session-Id header) receive HTTP 503 with a Retry-After header when the server-wide cap is reached. Default: 100 sessions.
Per-client session limit: CreateSession enforces a maximum number of concurrent sessions per auth.Identity.Subject. Anonymous clients are exempt. The counter is rolled back on all failure paths. Default: 10 sessions per identity.
Idle session timeout: a background reaper goroutine terminates sessions that have had no CallTool activity for longer than the configured threshold. The idle clock resets on every tool call and is initialised when a session is fully established. The reaper is wired into shutdownFuncs so it stops cleanly on server shutdown. Default: 5 min.
All three limits are configurable via Config fields; zero values select the defaults. The Limits struct is passed to sessionmanager.New() and all existing call sites are updated.
Fixes #3874
Type of change
Test plan
task test)task test-e2e)task lint-fix)Does this introduce a user-facing change?
No
Special notes for reviewers